Preparations

Load the necessary libraries

library(gbm)         #for gradient boosted models
library(car)
library(dismo)
library(pdp)
library(ggfortify)
library(randomForest)
library(tidyverse)
library(gridExtra)
library(patchwork)

Scenario

Verneaux (1973) measured the abundance of 27 fish species and 11 environmental variables from 30 locations along the Doubs river which runs along the France-Switzerland border. The environmental data comprised a range of hydrology, geomorphology and chemistry parameters and amungst other things, the ichthyologist was interested in relating fish abundances to the environmental drivers.

The data are in two files.

  • verneaux.fish.csv - the abundance of 27 fish species
  • verneaux.env.csv - the environmental data
Variable Description
DAS Distance from source (km)
ALT Altitude (m above sea level)
PEN Slope (per thousand)
DEB Mean minimum dischange (m3.s-1)
PH Water pH
DUR Calcium concentration (hardness) (mgL^-1)
PHO Phosphorus concentration (mgL^-1)
NIT Nitrate concentration (mgL^-1)
AMM Ammonium concentration (mgL^-1)
OXY Dissolved oxygen (mgL^-1)
BDO Biological oxygen demand (mgL^-1)

Read in the data

fish = read_csv('../public/data/verneaux.fish.csv', trim_ws=TRUE)
glimpse(fish)
## Rows: 30
## Columns: 27
## $ CHA <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 3, 3, 2, 1, 1, 0, 0, 0, 0, …
## $ TRU <dbl> 3, 5, 5, 4, 2, 3, 5, 0, 0, 1, 3, 5, 5, 5, 4, 3, 2, 1, 0, 0, 0, 0, …
## $ VAI <dbl> 0, 4, 5, 5, 3, 4, 4, 0, 1, 4, 4, 4, 5, 5, 4, 3, 4, 3, 3, 1, 1, 0, …
## $ LOC <dbl> 0, 3, 5, 5, 2, 5, 5, 0, 3, 4, 1, 4, 2, 4, 5, 5, 4, 3, 5, 2, 1, 1, …
## $ OMB <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 2, 0, 1, 1, 0, 0, 0, 0, …
## $ BLA <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 3, 4, 5, 2, 1, 1, 0, 0, 0, …
## $ HOT <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 2, 2, 3, …
## $ TOX <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 3, 3, 2, 2, 2, …
## $ VAN <dbl> 0, 0, 0, 0, 5, 1, 1, 0, 0, 2, 0, 0, 0, 0, 3, 5, 3, 2, 2, 2, 2, 3, …
## $ CHE <dbl> 0, 0, 0, 1, 2, 2, 1, 0, 5, 2, 1, 1, 0, 1, 3, 2, 2, 3, 1, 3, 2, 4, …
## $ BAR <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 3, 3, 2, 4, 4, 5, …
## $ SPI <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 4, 3, 2, 3, 2, 1, …
## $ GOU <dbl> 0, 0, 0, 1, 2, 1, 0, 0, 0, 1, 0, 0, 0, 1, 2, 2, 1, 2, 4, 4, 5, 5, …
## $ BRO <dbl> 0, 0, 1, 2, 4, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 2, 3, 3, …
## $ PER <dbl> 0, 0, 0, 2, 4, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 1, 2, 3, 4, …
## $ BOU <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 3, 3, 3, …
## $ PSO <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 3, …
## $ ROT <dbl> 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, …
## $ CAR <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 3, …
## $ TAN <dbl> 0, 0, 0, 1, 3, 2, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 4, 4, 4, …
## $ BCO <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 3, 4, …
## $ PCH <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, …
## $ GRE <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 3, 4, …
## $ GAR <dbl> 0, 0, 0, 0, 5, 1, 0, 0, 4, 0, 0, 0, 0, 0, 0, 1, 2, 2, 5, 5, 5, 5, …
## $ BBO <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, …
## $ ABL <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 3, 5, 5, 5, …
## $ ANG <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 2, 2, …
env = read_csv('../public/data/verneaux.env.csv', trim_ws=TRUE)
glimpse(env)
## Rows: 30
## Columns: 11
## $ DAS <dbl> 0.3, 2.2, 10.2, 18.5, 21.5, 32.4, 36.8, 49.1, 70.5, 99.0, 123.4, 1…
## $ ALT <dbl> 934, 932, 914, 854, 849, 846, 841, 792, 752, 617, 483, 477, 450, 4…
## $ PEN <dbl> 48.0, 3.0, 3.7, 3.2, 2.3, 3.2, 6.6, 2.5, 1.2, 9.9, 4.1, 1.6, 2.1, …
## $ DEB <dbl> 0.84, 1.00, 1.80, 2.53, 2.64, 2.86, 4.00, 1.30, 4.80, 10.00, 19.90…
## $ PH  <dbl> 7.9, 8.0, 8.3, 8.0, 8.1, 7.9, 8.1, 8.1, 8.0, 7.7, 8.1, 7.9, 8.1, 8…
## $ DUR <dbl> 45, 40, 52, 72, 84, 60, 88, 94, 90, 82, 96, 86, 98, 98, 86, 88, 92…
## $ PHO <dbl> 0.01, 0.02, 0.05, 0.10, 0.38, 0.20, 0.07, 0.20, 0.30, 0.06, 0.30, …
## $ NIT <dbl> 0.20, 0.20, 0.22, 0.21, 0.52, 0.15, 0.15, 0.41, 0.82, 0.75, 1.60, …
## $ AMM <dbl> 0.00, 0.10, 0.05, 0.00, 0.20, 0.00, 0.00, 0.12, 0.12, 0.01, 0.00, …
## $ OXY <dbl> 12.2, 10.3, 10.5, 11.0, 8.0, 10.2, 11.1, 7.0, 7.2, 10.0, 11.5, 12.…
## $ DBO <dbl> 2.7, 1.9, 3.5, 1.3, 6.2, 5.3, 2.2, 8.1, 5.2, 4.3, 2.7, 3.0, 2.4, 3…

Data preparation

For this example, we are going to focus on total fish abundance and therefore we need to sum the total number of fish within each location (rows).

Exploratory data analysis

Fit the model

Explore relative influence

Explore partial effects

Explore accuracy

Explore interactions

Tuning

Random Forest

library(randomForest)
fish.rf = randomForest(fish ~ DAS + ALT + PEN + DEB + PH + DUR + PHO + NIT + AMM +
                           OXY + DBO,
                       data=env, importance=TRUE,
                       ntree=1000)
fish.imp = randomForest::importance(fish.rf)
## Rank by either:
## *MSE (mean decrease in accuracy)
## For each tree, calculate OOB prediction error.
## This also done after permuting predictors.
## Then average diff of prediction errors for each tree
## *NodePurity (mean decrease in node impurity)
## Measure of the total decline of impurity due to each
## predictor averaged over trees
100*fish.imp/sum(fish.imp)
##         %IncMSE IncNodePurity
## DAS  0.10514608     19.578297
## ALT  0.08822959     15.527627
## PEN  0.03681748      7.102072
## DEB  0.08942111     18.600003
## PH  -0.01288507      1.065126
## DUR  0.03792810      8.052133
## PHO  0.05512223      5.574289
## NIT  0.06791615      8.942640
## AMM  0.05865963      5.675618
## OXY  0.04979067      4.982862
## DBO  0.05360180      4.269585
varImpPlot(fish.rf)

## use brute force
fish.rf %>% pdp::partial('DAS') %>% autoplot

fish.rf.acc <- env %>%
    bind_cols(Pred = predict(fish.rf,
                             newdata=env))

with(fish.rf.acc,  cor(fish, Pred))
## [1] 0.9843458
fish.rf.acc %>%
  ggplot() +
  geom_point(aes(y=Pred,  x=fish)) +
  geom_point(data = fish.acc, aes(y=Pred,  x=fish), colour = 'red')